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Abstract 


A central goal of explainable artificial intelli- 
gence (XAI) is to improve the trust relationship 
in human-AI interaction. One assumption under- 
lying research in transparent AI systems is that 
explanations help to better assess predictions of 
machine learning (ML) models, for instance by 
enabling humans to identify wrong predictions 
more efficiently. Recent empirical evidence how- 
ever shows that explanations can have the oppo- 
site effect: When presenting explanations of ML 
predictions humans often tend to trust ML predic- 
tions even when these are wrong. Experimental 
evidence suggests that this effect can be attributed 
to how intuitive, or human, an AI or explanation 
appears. This effect challenges the very goal of 
XAI and implies that responsible usage of trans- 
parent AI methods has to consider the ability of 
humans to distinguish machine generated from 
human explanations. Here we propose a quanti- 
tative metric for XAI methods based on Turing’s 
imitation game, a Turing Test for Transparency. 
A human interrogator is asked to judge whether 
an explanation was generated by a human or by 
an XAI method. Explanations of XAI methods 
that can not be detected by humans above chance 
performance in this binary classification task are 
passing the test. Detecting such explanations is 
a requirement for assessing and calibrating the 
trust relationship in human-AI interaction. We 
present experimental results on a crowd-sourced 
text classification task demonstrating that even 
for basic ML models and XAI approaches most 
participants were not able to differentiate human 
from machine generated explanations. We discuss 
ethical and practical implications of our results 
for applications of transparent ML. 
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1. Introduction 


Machine learning (ML) systems are increasingly being used 
to automate or assist decision making. The growing com- 
plexity of ML systems is accompanied by an increased 
demand for transparency in automated decision making, 
both for technical and for ethical reasons. Transparency of 
an ML system can help to debug ML models (Lapuschkin 
et al., 2019). And interpretable ML can increase trust in 
ML technology in human-Al interaction, for instance when 
revealing biases of a trained model (Phillips et al., 2018). 


A common goal of XAI methods is to generate explanations 
that render ML predictions comprehensible to humans. The 
underlying assumption is that understanding ML predictions 
is a requirement for trusting ML predictions. The quality of 
explanations can be empirically evaluated based on the con- 
cept of simulatability, meaning how helpful an explanation 
is for humans to replicate the ML prediction (Lipton, 2016; 
Poursabzi-Sangdeh et al., 2018; Hase & Bansal, 2020). 


Empirical evaluation of XAI quality based on simulatability 
demonstrates that explanations often do not contribute to 
a well calibrated trust relationship in human-AI collabora- 
tion!: explanations often do not help to improve simulata- 
bility (Hase & Bansal, 2020) and explanations do not help 
to identify wrong AI predictions (Poursabzi-Sangdeh et al., 
2018). These findings challenge some the assumptions of 
XAI research: If explanations do not enable humans to iden- 
tify wrong predictions and do not improve interpretability 
which metric should XAI researchers strive to optimize? 


One observation that was made in studies investigating the 
human-AI trust relationship is that humans tend to trust hu- 
mans more than AI systems — even when humans are known 
to perform worse than the AI (Dietvorst et al., 2015). This 
effect is also observed in XAI studies, in which researchers 
find that humans would follow wrong transparent AI predic- 
tions, if the explanation appears intuitive or human (Schmidt 
et al., 2020). These findings suggest that controlling for how 
intuitive or human an explanation appears is an important — 
and so far underrepresented — aspect of machine generated 
explanations. Here we present, to the best of our knowledge, 


‘Well calibrated here refers to a trust relationship in which 
neither blind trust, meaning humans follow (transparent) AI predic- 
tions independent of them being correct or wrong, nor ignorance 
of useful AI advice occurs. 
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Figure 1. Turing Test for Transparency proposed in this study, 
in analogy to Turing’s imitation game. Human interrogators are 
asked to differentiate human from machine generated explanations 
in a decision making task. If humans perform at chance level, the 
machine generated explanations are passing the test. 


the first approach to measure this aspect of machine gen- 
erated explanations. In analogy to the idea put forward by 
Alan Turing, which he himself referred to as the imitation 
game, now referred to as the Turing Test (Turing, 1950), we 
here propose a Turing Test for Transparency. Similar to the 
original imitation game we assume that there is a human 
interrogator who is presented with the output of a ML model 
or a human. In addition to the original imitation game we 
also provide an explanation for the prediction made by a 
computer or a human subject. The task of the interrogator 
is now to decide whether the explanation was generated by 
a human or a computer. An illustration of the experiment is 
sketched in Figure 1. 


2. Experiments 


We conducted two experiments, one for data collection and 
one for the actual Turing Test on the crowd-working plat- 
form Amazon Mechanical Turk. We recruited 145 workers, 
each worker participated in both experiments after passing 


a simple bot detection mechanism’. 


2.1. Data Set, Classification Model and XAI Method 


We used a publicly available IMDb movie review senti- 
ment dataset originally introduced in (Maas et al., 2011) 
and selected a set of reviews for which the ML model’s 
classification accuracy was 80%. We used a unigram bag-of- 


?Participants were asked what the annotation task is about and 
qualified as human if the subject chose the correct out of three 
possible answers. 


words feature extractor followed by term-frequency inverse 
document frequency normalization. The sparse feature vec- 
tors where then fed into an Lz regularized Logistic Regres- 
sion model and trained using stochastic gradient descent. 
The regularization was optimized using grid search using 
scikit-learn (Pedregosa et al., 2011). This basic text classi- 
fication pipeline was trained on 25,000 movie reviews and 
achieved precision/recall/fl-scores of 0.87 on a test set of 
also 25,000 samples. Explanations for each review were 
computed based on covariance of unigram features with the 
class likelihood (Schmidt & Biessmann, 2019). 


2.2. Experiment 1: Collecting Human Explanations 


In order to compare explanations of humans and the ML 
model we acquired five samples of human explanations 
and annotations for movies randomly drawn from the set 
of 50 movie reviews. Subjects were asked to classify the 
reviews as positive or negative and mark the three words 
most relevant for their decision. 


2.3. Experiment 2: Turing Test for Transparency 


Each subject was shown five movie reviews, again drawn 
at random from those reviews they have not seen in the 
first experiment. In contrast to the first experiment sub- 
jects were also shown the prediction of a human or the ML 
model along with the respective explanation for the predic- 
tion. Both human and machine generated explanations were 
represented by highlighting the three words most relevant 
for the prediction. 


3. Results 


After all 145 subjects performed both experiments we fil- 
tered out participants who did not achieve an annotation 
accuracy of at least 60%, which reduced the number of 
subjects to 133. 


3.1. Results Turing Test for Transparency 


We first quantified whether humans were able to distinguish 
human generated from AI generated explanations. Note that 
we discarded those human generated explanations obtained 
in experiment | for which human annotations were incorrect. 
The results are shown in Table 2 as standard classification 
metrics, precision, recall, Fl score for both human and AI 
generated explanations and accuracy, aggregated across both 
classes. 


XAI Explanations pass Turing Test All metrics demon- 
strate that in this particular experiment humans were not 
able to differentiate human from AI explanations as the 
metrics indicate that human annotators performed around 
chance performance in the Turing Test. 
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Table 1. Comparison of explanations generated by human subjects and AI. Shown are five words for each set (selected by humans and 
not AI, selected by AI but not by humans and selected by both humans and AI), randomly drawn for positive and negative class after 
controlling for the number of words in human and AI explanations. Qualitative comparison suggests that the explanations generated by 
humans and the XAI method used are similar with respect to semantic and syntactic features. 


positive negative 
Humans \ AI AI \ Human Humans N AI Humans \ AI AI \ Human Humans N AI 
extraordinary, nonsensical, 
impressed, human, family, magnificent, bril- unfunny, dis- disappointing, dull, horrible, 
recommend, story, lovable, liant, enjoyed, ex- appointed, avoid, thing, ridiculous, worst, 
hilarious, inter- relationship cellent, superb recommended, reason, premise awful 
estingly sick 


Table 2. Results of Turing Test for Transparency. Shown are preci- 
sion, recall, F1 score and accuracy for the task of distinguishing 
human generated explanations from machine generated explana- 
tions. All metrics indicate that humans were not able to distinguish 
human and AI explanations. This can be interpreted as the machine 
generated explanations passing the Turing Test. 


precision recall fl-score support 
ML model 0.49 = 0.50 0.49 311 
human 0.56 0.56 0.56 359 
accuracy 0.53 0.53 0.53 
weighted avg 0.53 0.53 0.53 670 


Average: 0.53 
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Figure 2. Results of turing test for transparency. Shown are his- 
togram of accuracies with which human subjects can classify cor- 
rectly human vs. AI explanations. On average humans performed 
around chance with an overall accuracy of 53%, indicating that the 
explanations provided by the AI were passing the turing test. 


We further investigated the distribution of accuracies in the 
Turing Test across subjects to assess whether there is any 
systematic pattern or bias in the set of participants sampled. 
In Figure 2 we show the distribution of accuracies in the 
Turing Test across all subjects. There seems to be no partic- 
ular bias in the subject sample and the overall distribution 
appears to be close to a normal distribution centered around 
0.53. 


Annotation accuracy vs. Turing Test accuracy: 
Grouped by subjects One might hypothesize that the 
variability across participants to differentiate human from 
XAI explanations is due to cognitive abilities that also en- 
able subjects to perform well in the annotation task itself. 
In order to investigate this relationship computed the corre- 
lation between the annotation accuracy of each annotator 
and their performance in the Turing Test. The correlation 
was 0.14, indicating that there is no strong dependency be- 
tween task performance and Turing Test performance. This 
effect could be explained by different cognitive mechanisms 
governing human decisions in the two tasks. 


Annotation accuracy vs. Turing Test accuracy: 
Grouped by reviews When grouping the data by movie 
reviews we see a slightly stronger but negative correlation 
of -0.29, as shown in Figure 3. This trend indicates that 
human generated explanations for reviews that were often 
annotated correctly in the original classification task tended 
to fail the Turing Test for Transparency more often. In con- 
trast human generated explanations movie reviews which 
were often annotated incorrectly were often easier to dif- 
ferentiate from machine generated explanations. Note that 
all human generated explanations for which human annota- 
tors provided a wrong annotation were discarded from the 
Turing Test for Transparency. 


Turing Test Accuracy: Humans worse than AI We 
trained the same model used for the original text classi- 
fication task on the binary classification task of the Turing 
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Figure 3. The accuracy of human annotators in the binary movie 
review annotation task shows a slightly negative correlation with 
the accuracy of human annotators in the Turing Test for Trans- 
parency. Explanations corresponding to movie reviews that were 
correctly annotated were more difficult to distinguish from ma- 
chine generated explanations. Note that human generated explana- 
tions for reviews that were incorrectly annotated by humans were 
discarded. 


Test for Transparency. For each movie review the human 
or machine generated explanation was the input to the ML 
model and the binary label was human generated or machine 
generated. Model training was done for different training 
set sizes to estimate the sample efficiency of the model. In 
Figure 4 we show the human accuracy as a baseline and the 
10th/50th/90th quantile of a set of ML models trained on 
equally sized independently sampled training data sets. The 
number of ML models trained for each training set size was 
chosen to match the number of subjects. 


Our results show that for small training set sizes the ML 
models did not achieve accuracies significantly larger than 
those of human subjects. For larger training set sizes how- 
ever the ML model appeared to achieve slightly but signif- 
icantly higher accuracies compared to humans. We tested 
for significance using a Kruskal-Wallis test for equal vari- 
ances across the two distributions of human and ML model 
accuracies and corrected for multiple comparisons using 
Bonferroni correction. These results show that in this par- 
ticular example, ML models can better differentiate human 
from machine generated explanations when the training set 
size was larger than 30 reviews. One could argue that the 
ML models only achieve significantly higher accuracies 
when seeing more training data than human subjects did in 
our experiment. But on the other hand the implicit assump- 
tion of the Turing Test is that humans have seen so many 
training examples, at least of the human generated class, 
that they know how human explanations look like. 
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Figure 4. Accuracy of passing the Turing Test by an AI or human 
subjects. Shown are the human accuracy black dashed line) and 
the accuracy (10th/50th/90th quantile) of an AI for varying training 
set size (shown on x-axis). Stars indicate the training set sizes for 
which the accuracy distribution over ML models was significantly 
different from the Turing Test accuracy distribution across all 
human subjects (p < 0.05, Bonferroni corrected). 


4. Conclusion 


A common assumption of XAI research is that explanations 
increase trust in human-AlI collaboration by making it easier 
to spot wrong predictions (Lapuschkin et al., 2019) and to 
speed up and improve the confirmation of correct ML predic- 
tions (Schmidt & Biessmann, 2019). These assumptions are 
challenged by increasing evidence for XAI methods leading 
to biases in human-AI interation, for example when humans 
trust transparent ML predictions even when they are wrong 
(Poursabzi-Sangdeh et al., 2018; Schmidt et al., 2020). One 
reason for this bias could be explained by the empirical 
observation that humans tend to trust ML systems less than 
humans, even when they know that the ML system performs 
better (Dietvorst et al., 2015). These findings suggest that in 
order to calibrate the XAI system to the right level of trans- 
parency, it is important to consider how intuitive or human 
an explanation is. Here we present a quantitative metric in 
analogy to the imitation game (Turing Test) that directly cap- 
tures this effect. Our results demonstrate that some machine 
generated textual explanations cannot be differentiated from 
human generated explanations. Future work will explore 
this effect with more complex explanations. While this 
can be regarded as an achievement of XAI research, it also 
raises ethical questions on the application of transparent ML 
methods. If humans are biased towards blindly following 
intuitive explanations we argue that considering humans’ 
ability to detect machine generated explanations is an im- 
portant factor for responsible usage of XAI methods. We 
hope that the proposed XAI quality metric can contribute to 
a better quantitative evaluation of transparent ML methods. 
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